Search | VHL Regional Portal

Authors' Reply: "Evaluating GPT-4's Cognitive Functions Through the Bloom Taxonomy: Insights and Clarifications".

Herrmann-Werner, Anne; Festl-Wietek, Teresa; Holderried, Friederike; Herschbach, Lea; Griewatz, Jan; Masters, Ken; Zipfel, Stephan; Mahling, Moritz.

J Med Internet Res ; 26: e57778, 2024 Apr 16.

Article in English | MEDLINE | ID: mdl-38625723

Subject(s)

Cognition , Humans

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.

Herrmann-Werner, Anne; Festl-Wietek, Teresa; Holderried, Friederike; Herschbach, Lea; Griewatz, Jan; Masters, Ken; Zipfel, Stephan; Mahling, Moritz.

J Med Internet Res ; 26: e52113, 2024 Jan 23.

Article in English | MEDLINE | ID: mdl-38261378

ABSTRACT

BACKGROUND: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. OBJECTIVE: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. METHODS: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. RESULTS: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. CONCLUSIONS: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

Subject(s)

Education, Medical , Medicine , Psychosomatic Medicine , Humans , Research Design

A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study.

Holderried, Friederike; Stegemann-Philipps, Christian; Herschbach, Lea; Moldt, Julia-Astrid; Nevins, Andrew; Griewatz, Jan; Holderried, Martin; Herrmann-Werner, Anne; Festl-Wietek, Teresa; Mahling, Moritz.

JMIR Med Educ ; 10: e53961, 2024 Jan 16.

Article in English | MEDLINE | ID: mdl-38227363

ABSTRACT

BACKGROUND: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions. OBJECTIVE: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication. METHODS: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ). RESULTS: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points). CONCLUSIONS: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information.

Subject(s)

Communication , Students, Medical , Humans , Young Adult , Adult , Prospective Studies , Language , Medical History Taking

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL